{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 训练集数据预处理\n",
    "*说明：数据预处理阶段，生成的特征包括（18个）：*\n",
    "\n",
    "*出发机场、到达机场、航班编号、飞机编号、计划飞行时间、计划起飞时刻、航班月份、计划到达时刻、前序延误、起飞间隔、到达特情、出发特情、出发天气、出发气温、到达天气、到达气温、航空公司、航班性质。*\n",
    "\n",
    "**目录**\n",
    "1. 读取文件\n",
    "2. 时间信息预处理\n",
    "3. 前序航班的延误时间&到达与起飞间隔\n",
    "4. 优化内存&类型转换\n",
    "5. 特情\n",
    "6. 天气\n",
    " - 气温\n",
    " - 天气情况\n",
    " - 将天气匹配到航班动态表中\n",
    "7. 航班特征\n",
    "8. 优化内存&类型转换\n",
    "9. 将数据保存到文件中"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 读取文件\n",
    "**文件对应关系为：**\n",
    "\n",
    "1. flight.csv: 航班动态数据.csv\n",
    "2. weather.csv: 城市天气.csv\n",
    "3. airport.csv: 机场城市对应表.csv\n",
    "4. spcial.csv: 特情.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import os\n",
    "import datetime\n",
    "\n",
    "# 获取当前的工作目录\n",
    "pwd = os.getcwd()\n",
    "# 将工作目录更改到训练集\n",
    "os.chdir(\"原始训练集\")\n",
    "# ——————————————————————————读取数据—————————————————————————————— #\n",
    "# 航班数据\n",
    "flight_data = pd.read_table('flight.csv',sep=',',encoding='gb2312')\n",
    "# 天气数据\n",
    "\n",
    "weather = pd.read_csv('weather.csv')\n",
    "# 城市与机场对应数据\n",
    "airport_city = pd.read_excel('airport_city.xlsx')\n",
    "# 特情\n",
    "spcial = pd.read_excel('spcial.xlsx')\n",
    "# 改回原来的工作目录\n",
    "os.chdir(pwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取后的航班动态展示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>出发机场</th>\n",
       "      <th>到达机场</th>\n",
       "      <th>航班编号</th>\n",
       "      <th>计划起飞时间</th>\n",
       "      <th>计划到达时间</th>\n",
       "      <th>实际起飞时间</th>\n",
       "      <th>实际到达时间</th>\n",
       "      <th>飞机编号</th>\n",
       "      <th>航班是否取消</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>HGH</td>\n",
       "      <td>DLC</td>\n",
       "      <td>CZ6328</td>\n",
       "      <td>1453809600</td>\n",
       "      <td>1453817100</td>\n",
       "      <td>1.453813e+09</td>\n",
       "      <td>1.453819e+09</td>\n",
       "      <td>1.0</td>\n",
       "      <td>正常</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SHA</td>\n",
       "      <td>XMN</td>\n",
       "      <td>FM9261</td>\n",
       "      <td>1452760800</td>\n",
       "      <td>1452767100</td>\n",
       "      <td>1.452763e+09</td>\n",
       "      <td>1.452768e+09</td>\n",
       "      <td>2.0</td>\n",
       "      <td>正常</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CAN</td>\n",
       "      <td>WNZ</td>\n",
       "      <td>ZH9597</td>\n",
       "      <td>1453800900</td>\n",
       "      <td>1453807500</td>\n",
       "      <td>1.453802e+09</td>\n",
       "      <td>1.453807e+09</td>\n",
       "      <td>3.0</td>\n",
       "      <td>正常</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SHA</td>\n",
       "      <td>ZUH</td>\n",
       "      <td>9C8819</td>\n",
       "      <td>1452120600</td>\n",
       "      <td>1452131100</td>\n",
       "      <td>1.452121e+09</td>\n",
       "      <td>1.452130e+09</td>\n",
       "      <td>4.0</td>\n",
       "      <td>正常</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>SHE</td>\n",
       "      <td>TAO</td>\n",
       "      <td>TZ185</td>\n",
       "      <td>1452399000</td>\n",
       "      <td>1452406800</td>\n",
       "      <td>1.452400e+09</td>\n",
       "      <td>1.452404e+09</td>\n",
       "      <td>5.0</td>\n",
       "      <td>正常</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  出发机场 到达机场    航班编号      计划起飞时间      计划到达时间        实际起飞时间        实际到达时间  飞机编号  \\\n",
       "0  HGH  DLC  CZ6328  1453809600  1453817100  1.453813e+09  1.453819e+09   1.0   \n",
       "1  SHA  XMN  FM9261  1452760800  1452767100  1.452763e+09  1.452768e+09   2.0   \n",
       "2  CAN  WNZ  ZH9597  1453800900  1453807500  1.453802e+09  1.453807e+09   3.0   \n",
       "3  SHA  ZUH  9C8819  1452120600  1452131100  1.452121e+09  1.452130e+09   4.0   \n",
       "4  SHE  TAO   TZ185  1452399000  1452406800  1.452400e+09  1.452404e+09   5.0   \n",
       "\n",
       "  航班是否取消  \n",
       "0     正常  \n",
       "1     正常  \n",
       "2     正常  \n",
       "3     正常  \n",
       "4     正常  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flight_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 时间信息预处理\n",
    "## 生成的特征&字段\n",
    "1. 计划飞行时间（单位：h）【可间接反映出航线距离】\n",
    "2. 计划起飞日期&计划到达日期【用以匹配天气和特情信息】\n",
    "3. 计划起飞时刻&计划到达时刻【不同时刻的航班】\n",
    "4. 航班月份\n",
    "5. 起飞延误时间（单位：h），取消的航班直接设置为延误10小时（大于3个小时即可）\n",
    "6. 到达延误时间（单位：h）【主要用作后续航班的一个参考】"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 转化成日期格式\n",
    "flight_data['计划起飞时间1'] = pd.to_datetime(flight_data['计划起飞时间'],unit='s',utc=True)\n",
    "flight_data['计划到达时间1'] = pd.to_datetime(flight_data['计划到达时间'],unit='s',utc=True)\n",
    "# 计划飞行时间\n",
    "flight_data['计划飞行时间'] = flight_data['计划到达时间1'] - flight_data['计划起飞时间1']\n",
    "flight_data['计划飞行时间'] = flight_data['计划飞行时间'].apply(lambda x: x.days * 86400 + x.seconds if not(pd.isnull(x)) else None)\n",
    "\n",
    "flight_data['计划飞行时间'] = flight_data['计划飞行时间']/3600  # 转换为小时\n",
    "# 细分时间段\n",
    "flight_data['计划起飞日期'] = flight_data['计划起飞时间1'].apply(lambda x:x.strftime('%Y-%m-%d') if not(pd.isnull(x)) else None)\n",
    "flight_data['计划起飞时刻'] = flight_data['计划起飞时间1'].apply(lambda x:x.strftime('%H') if not(pd.isnull(x)) else None)\n",
    "flight_data['航班月份'] = flight_data['计划起飞时间1'].apply(lambda x:int(x.strftime('%m')) if not(pd.isnull(x)) else None)\n",
    "\n",
    "flight_data['计划到达日期'] = flight_data['计划到达时间1'].apply(lambda x:x.strftime('%Y-%m-%d') if not(pd.isnull(x)) else None)\n",
    "flight_data['计划到达时刻'] = flight_data['计划到达时间1'].apply(lambda x:x.strftime('%H') if not(pd.isnull(x)) else None)\n",
    "# 延误\n",
    "flight_data['起飞延误时间'] = pd.to_datetime(flight_data['实际起飞时间'],unit='s',utc=True) - pd.to_datetime(flight_data['计划起飞时间'],unit='s',utc=True)\n",
    "flight_data['起飞延误时间'] = flight_data['起飞延误时间'].apply(lambda x: x.days * 86400 + x.seconds if not(pd.isnull(x)) else None)\n",
    "flight_data['起飞延误时间'] = flight_data['起飞延误时间']/3600  # 转换为小时\n",
    "flight_data['起飞延误时间'] = np.where(flight_data['航班是否取消'] == '取消',10,flight_data['起飞延误时间'])\n",
    "\n",
    "\n",
    "flight_data['到达延误时间'] = pd.to_datetime(flight_data['实际到达时间'],unit='s',utc=True) - pd.to_datetime(flight_data['计划到达时间'],unit='s',utc=True)\n",
    "flight_data['到达延误时间'] = flight_data['到达延误时间'].apply(lambda x: x.days * 86400 + x.seconds if not(pd.isnull(x)) else None)\n",
    "flight_data['到达延误时间'] = flight_data['到达延误时间']/3600  # 转换为小时\n",
    "flight_data['到达延误时间'] = np.where(flight_data['航班是否取消'] == '取消',10,flight_data['到达延误时间'])\n",
    "\n",
    "# 删除后面用不着的字段，节约内存\n",
    "del flight_data['计划起飞时间1']\n",
    "del flight_data['计划到达时间1']\n",
    "del(flight_data['实际起飞时间'])\n",
    "del(flight_data['航班是否取消'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 前序航班的延误时间&到达与起飞间隔\n",
    "这里前序航班的定义为：同一架飞机，当前航班的前一个航班即为当前航班的前序航班。比如，同一架飞机连续飞两个航班A：南京--北京，B：北京--西安，则A为B的前序航班。\n",
    "## 延误时间（单位：h）\n",
    "前序航班的延误时间定义为前序航班到达延误时间，即时间到达时间减去计划到达时间\n",
    "## 到达与起飞间隔（单位：h）\n",
    "当前航班的计划起飞时间与前序航班时间到达时间的间隔"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 飞机编号有缺失值的统一编号为‘0’\n",
    "flight_data['飞机编号']= flight_data['飞机编号'].fillna(0)\n",
    "flight_data['前序延误'] = pd.Series()\n",
    "flight_data['起飞间隔'] = pd.Series()\n",
    "# 根据航班的飞机编号分组\n",
    "grouped = flight_data.groupby(flight_data['飞机编号'])\n",
    "chunks = []\n",
    "for name,group in grouped:\n",
    "#     print(name)\n",
    "# 对每一架飞机按照计划起飞时间排序，以得到航班的顺序关系\n",
    "    group = group.sort_values('计划起飞时间')\n",
    "    a = pd.to_datetime(group['计划起飞时间'],unit='s',utc=True)[1:].reset_index(drop=True)\n",
    "    b = pd.to_datetime(group['实际到达时间'],unit='s',utc=True)[0:len(group)-1].reset_index(drop=True)  \n",
    "    group['起飞间隔'][1:] = a-b\n",
    "    group['起飞间隔'] = group['起飞间隔'].apply(lambda x: x.days * 86400 + x.seconds if not(pd.isnull(x)) else None)\n",
    "    group['起飞间隔'] = group['起飞间隔']/3600\n",
    "#     前序航班的延误时间\n",
    "    group['前序延误'][1:] = group['到达延误时间'][0:len(group)-1]\n",
    "    chunks.append(group)\n",
    "flight_data = pd.concat(chunks, ignore_index=True)\n",
    "# 删除无用的变量和字段，释放内存\n",
    "del(grouped)\n",
    "del(group)\n",
    "del(chunks)\n",
    "del(flight_data['计划起飞时间'])\n",
    "del(flight_data['计划到达时间'])\n",
    "del(flight_data['实际到达时间'])\n",
    "# 因为编号为0的飞机其实是缺失值，真实情况下的航班可能并没有顺序关系，所以暂时将其设置为NAN\n",
    "flight_data['前序延误'][flight_data['飞机编号']==0] = np.NaN\n",
    "flight_data['起飞间隔'][flight_data['飞机编号']==0] = np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 优化内存&类型转换\n",
    "**由于笔记本内存只有4G，所以时刻想着优化内存，以免死机**\n",
    "也包括将部分字符型特征的转换为数值型特征\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "flight_data['出发机场'] = flight_data['出发机场'].astype('category')\n",
    "flight_data['到达机场'] = flight_data['到达机场'].astype('category')\n",
    "flight_data['航班编号'] = flight_data['航班编号'].astype('category')\n",
    "flight_data['飞机编号'] = pd.to_numeric(flight_data['飞机编号'],errors='ignore',downcast='float')\n",
    "flight_data['起飞延误时间'] = pd.to_numeric(flight_data['起飞延误时间'],errors='ignore',downcast='float')\n",
    "flight_data['到达延误时间'] = pd.to_numeric(flight_data['到达延误时间'],errors='ignore',downcast='float')\n",
    "flight_data['计划飞行时间'] = pd.to_numeric(flight_data['计划飞行时间'],errors='ignore',downcast='float')\n",
    "flight_data['计划起飞时刻'] = pd.to_numeric(flight_data['计划起飞时刻'],errors='ignore',downcast='unsigned')\n",
    "flight_data['计划到达时刻'] = pd.to_numeric(flight_data['计划到达时刻'],errors='ignore',downcast='unsigned')\n",
    "flight_data['航班月份'] = pd.to_numeric(flight_data['航班月份'],errors='ignore',downcast='unsigned')\n",
    "flight_data['计划起飞日期'] = flight_data['计划起飞日期'].astype('category')\n",
    "flight_data['计划到达日期'] = flight_data['计划到达日期'].astype('category')\n",
    "flight_data['前序延误'] = pd.to_numeric(flight_data['前序延误'],errors='ignore',downcast='float')\n",
    "flight_data['起飞间隔'] = pd.to_numeric(flight_data['起飞间隔'],errors='ignore',downcast='float')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 特情\n",
    "特情这个特征的处理，没有区分特情的具体内容，只将特情发生的时间段对应到计划起飞和到达的时间，以0代表没有发生特情，1表示发生了特情，所以后面有继续优化这个特征的空间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "del(spcial['收集时间'])\n",
    "del(spcial['特情内容'])\n",
    "# 将机场的代码转换为大写\n",
    "for i in spcial.index:\n",
    "    s = str(spcial['特情机场'][i])\n",
    "    spcial['特情机场'][i] = s.upper()\n",
    "#  提取出特情发生的开始结束日期和时刻\n",
    "spcial['开始日期'] = spcial['开始时间'].apply(lambda x :str(x).split(' ')[0])\n",
    "spcial['开始时刻'] = pd.Series()\n",
    "spcial['结束时刻'] = pd.Series()\n",
    "for i in spcial.index:\n",
    "    if not(pd.isnull(spcial['开始时间'][i])):\n",
    "        spcial['开始时刻'][i] = int(str(spcial['开始时间'][i]).split(' ')[1][:2])\n",
    "    else :\n",
    "        spcial['开始时刻'][i] = np.NaN\n",
    "    if not(pd.isnull(spcial['结束时间'][i])):\n",
    "        spcial['结束时刻'][i] = int(str(spcial['结束时间'][i]).split(' ')[1][:2])\n",
    "    else :\n",
    "        spcial['结束时刻'][i] = np.NaN\n",
    "del(spcial['开始时间'])\n",
    "del(spcial['结束时间'])\n",
    "# 去重，避免左连接时发生多余样本\n",
    "spcial = spcial.drop_duplicates(['特情机场','开始日期'])\n",
    "# 到达城市特情\n",
    "flight_data = pd.merge(flight_data,spcial,left_on=['到达机场','计划到达日期'],right_on=['特情机场','开始日期'],how='left',sort=False)\n",
    "# 落在相应的时间段即为有特情\n",
    "flight_data['到达特情'] = np.where((flight_data['计划到达时刻'] >=flight_data['开始时刻']) &\n",
    "                               (flight_data['计划到达时刻']<= flight_data['结束时刻']),1,0)\n",
    "del(flight_data['特情机场'])\n",
    "del(flight_data['开始日期'] )\n",
    "del(flight_data['开始时刻'])\n",
    "del(flight_data['结束时刻'] )\n",
    "# 出发城市特情\n",
    "flight_data = pd.merge(flight_data,spcial,left_on=['出发机场','计划起飞日期'],right_on=['特情机场','开始日期'],how='left',sort=False)\n",
    "# 落在相应的时间段即为有特情\n",
    "flight_data['出发特情'] = np.where((flight_data['计划起飞时刻'] >=flight_data['开始时刻']) &\n",
    "                               (flight_data['计划起飞时刻']<= flight_data['结束时刻']),1,0)\n",
    "# 删除无用字段\n",
    "del(flight_data['特情机场'])\n",
    "del(flight_data['开始日期'] )\n",
    "del(flight_data['开始时刻'])\n",
    "del(flight_data['结束时刻'] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 天气\n",
    "天气特征的提取主要包括气温特征和天气情况，其中：\n",
    "\n",
    "1. 气温划分为3个取值，大于40度为高温，小于-10度为低温，其他为一般\n",
    "2. 天气情况（小雨、阴天等）根据城市天气表，首先统计两年时间内所有天气情况在各地区总共出现的频率，出现频率小于50的天气情况统一划归为‘other’。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 气温"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 气温处理\n",
    "weather['气温'] = pd.Series()\n",
    "weather['最高气温'][weather['最高气温']==' ']=  np.NaN\n",
    "weather['最低气温'][weather['最低气温']==' ']=  np.NaN\n",
    "weather['最高气温'] = weather['最高气温'].fillna('0')\n",
    "weather['最高气温'] = weather['最高气温'].apply(lambda x : str(x).strip())\n",
    "weather['最高气温'] = weather['最高气温'].apply(lambda x : str(x).split('.')[0])\n",
    "weather['最高气温'] = weather['最高气温'].apply(lambda x : str(x).split(' ')[0])\n",
    "weather['最低气温'] = weather['最低气温'].fillna('0')\n",
    "weather['最低气温'] = weather['最低气温'].apply(lambda x : str(x).strip())\n",
    "weather['最低气温'] = weather['最低气温'].apply(lambda x : str(x).split('.')[0])\n",
    "weather['最低气温'] = weather['最低气温'].apply(lambda x : str(x).split(' ')[0])\n",
    "weather['气温'] = weather['最高气温'].apply(lambda x: '高温' if int(x)>=40 else '一般')\n",
    "weather['气温'] = np.where(weather['最低气温'].astype('int') < -10,'低温',weather['气温'])\n",
    "del(weather['Unnamed: 5'])\n",
    "del(weather['最高气温'])\n",
    "del(weather['最低气温'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 天气情况\n",
    "**注意：这里把频率小于50的天气设置为‘other’后，当前所有的天气情况需要保存下来，用以在测试集中天气情况的匹配**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "weather = weather.drop_duplicates() # 只包含这三个字段\n",
    "case = weather['天气'].value_counts()\n",
    "case = case[case>=50]\n",
    "weather_case = set(case.index)\n",
    "weather['天气'] = weather['天气'].apply(lambda x: x if x in weather_case else 'other')\n",
    "# 将机场编码对应到天气数据上面，根据城市名\n",
    "airport_weather = pd.merge(weather,airport_city,left_on=['城市'],right_on=['城市名称'],how='left',sort=False)\n",
    "del(airport_weather['城市名称'])\n",
    "# 去除缺失值和重复的机场天气信息\n",
    "airport_weather = airport_weather.dropna()\n",
    "airport_weather = airport_weather.drop_duplicates(['日期','机场编码'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 将天气匹配到航班动态表中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 出发城市\n",
    "flight_data = pd.merge(flight_data,airport_weather,left_on=['出发机场','计划起飞日期'],right_on=['机场编码','日期'],how='left')\n",
    "flight_data['出发天气'] = flight_data['天气']\n",
    "flight_data['出发气温'] = flight_data['气温']\n",
    "del(flight_data['天气'])\n",
    "del(flight_data['机场编码'])\n",
    "del(flight_data['城市'])\n",
    "del[flight_data['日期']]\n",
    "del(flight_data['气温'])\n",
    "# 到达城市\n",
    "flight_data = pd.merge(flight_data,airport_weather,left_on=['到达机场','计划到达日期'],right_on=['机场编码','日期'],how='left')\n",
    "flight_data['到达天气'] = flight_data['天气']\n",
    "flight_data['到达气温'] = flight_data['气温']\n",
    "del(flight_data['天气'])\n",
    "del(flight_data['机场编码'])\n",
    "del(flight_data['城市'])\n",
    "del[flight_data['日期']]\n",
    "del(flight_data['气温'])\n",
    "del(flight_data['计划起飞日期'])\n",
    "del(flight_data['计划到达日期'])\n",
    "del(flight_data['到达延误时间'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 航班特征\n",
    "这里根据航班编号提取出的特征包括：\n",
    "\n",
    "1. 航空公司\n",
    "2. 航班的性质（摘自百度百科，一般尾号为字母的是补飞的，3位数字国内航班，4为数字国外航班）\n",
    " - 0：补飞\n",
    " - 1：国内正常\n",
    " - 2：国外"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "flight_data['航空公司'] = flight_data['航班编号'].apply(lambda x : x[:2])\n",
    "def f(x):\n",
    "    if x[-1].isalpha():\n",
    "        y = 0\n",
    "    elif len(x[2:]) == 4:\n",
    "        y = 1\n",
    "    else:\n",
    "        y = 2\n",
    "    return(y)\n",
    "flight_data['航班性质'] = flight_data['航班编号'].apply(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 内存再优化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "flight_data['出发天气'] = flight_data['出发天气'].astype('category')\n",
    "flight_data['到达天气'] = flight_data['到达天气'].astype('category')\n",
    "flight_data['出发气温'] = flight_data['出发气温'].astype('category')\n",
    "flight_data['到达气温'] = flight_data['到达气温'].astype('category')\n",
    "flight_data['航空公司'] = flight_data['航空公司'].astype('category')\n",
    "flight_data['前序延误'] = pd.to_numeric(flight_data['前序延误'],errors='ignore',downcast='float')\n",
    "flight_data['到达特情'] = pd.to_numeric(flight_data['到达特情'],errors='ignore',downcast='unsigned')\n",
    "flight_data['出发特情'] = pd.to_numeric(flight_data['出发特情'],errors='ignore',downcast='unsigned')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>出发机场</th>\n",
       "      <th>到达机场</th>\n",
       "      <th>航班编号</th>\n",
       "      <th>飞机编号</th>\n",
       "      <th>计划飞行时间</th>\n",
       "      <th>计划起飞时刻</th>\n",
       "      <th>航班月份</th>\n",
       "      <th>计划到达时刻</th>\n",
       "      <th>起飞延误时间</th>\n",
       "      <th>前序延误</th>\n",
       "      <th>起飞间隔</th>\n",
       "      <th>到达特情</th>\n",
       "      <th>出发特情</th>\n",
       "      <th>出发天气</th>\n",
       "      <th>出发气温</th>\n",
       "      <th>到达天气</th>\n",
       "      <th>到达气温</th>\n",
       "      <th>航空公司</th>\n",
       "      <th>航班性质</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>YNZ</td>\n",
       "      <td>TYN</td>\n",
       "      <td>GJ8831</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>多云</td>\n",
       "      <td>一般</td>\n",
       "      <td>晴</td>\n",
       "      <td>一般</td>\n",
       "      <td>GJ</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>PEK</td>\n",
       "      <td>HGH</td>\n",
       "      <td>FM9152</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.166667</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>雾</td>\n",
       "      <td>一般</td>\n",
       "      <td>多云</td>\n",
       "      <td>一般</td>\n",
       "      <td>FM</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PEK</td>\n",
       "      <td>XIY</td>\n",
       "      <td>CA1205</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.250000</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>14</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>雾</td>\n",
       "      <td>一般</td>\n",
       "      <td>雾</td>\n",
       "      <td>一般</td>\n",
       "      <td>CA</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CTU</td>\n",
       "      <td>TAO</td>\n",
       "      <td>CA4511</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.750000</td>\n",
       "      <td>22</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>晴</td>\n",
       "      <td>一般</td>\n",
       "      <td>晴转雾</td>\n",
       "      <td>一般</td>\n",
       "      <td>CA</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PEK</td>\n",
       "      <td>CTU</td>\n",
       "      <td>CA4102</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.083333</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>雾</td>\n",
       "      <td>一般</td>\n",
       "      <td>晴转阴</td>\n",
       "      <td>一般</td>\n",
       "      <td>CA</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  出发机场 到达机场    航班编号  飞机编号    计划飞行时间  计划起飞时刻  航班月份  计划到达时刻  起飞延误时间  前序延误  起飞间隔  \\\n",
       "0  YNZ  TYN  GJ8831   0.0  2.000000       1     1       3    10.0   NaN   NaN   \n",
       "1  PEK  HGH  FM9152   0.0  2.166667       5     1       7    10.0   NaN   NaN   \n",
       "2  PEK  XIY  CA1205   0.0  2.250000      12     1      14    10.0   NaN   NaN   \n",
       "3  CTU  TAO  CA4511   0.0  2.750000      22     1       1    10.0   NaN   NaN   \n",
       "4  PEK  CTU  CA4102   0.0  3.083333       2     1       5    10.0   NaN   NaN   \n",
       "\n",
       "   到达特情  出发特情 出发天气 出发气温 到达天气 到达气温 航空公司  航班性质  \n",
       "0     0     0   多云   一般    晴   一般   GJ     1  \n",
       "1     0     0    雾   一般   多云   一般   FM     1  \n",
       "2     0     0    雾   一般    雾   一般   CA     1  \n",
       "3     0     0    晴   一般  晴转雾   一般   CA     1  \n",
       "4     0     0    雾   一般  晴转阴   一般   CA     1  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flight_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 将数据保存到文件中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# train_data1.csv和columns_types.pickle保存在\"处理后训练集\"文件夹下\n",
    "# weather_case.csv 保存在原始测试集文件夹下\n",
    "import pickle\n",
    "os.chdir(\"处理后训练集\")\n",
    "flight_data.to_csv('train_data1.csv',index=False) # 保存训练数据\n",
    "# 训练集数类型\n",
    "colums_types = dict(flight_data.dtypes)\n",
    "with open('colums_type.pickle', 'wb') as f:\n",
    "     pickle.dump(colums_types, f) \n",
    "os.chdir(pwd)\n",
    "os.chdir(\"原始测试集\")\n",
    "pd.Series(list(weather_case)).to_csv('weather_case.csv',index=False,header=True) # 保存天气情况数据\n",
    "os.chdir(pwd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# n,m = [int(x) for x in input().split()]\n",
    "n ,m = 81,4\n",
    "consum = 0\n",
    "for i in range(m):\n",
    "    consum += n\n",
    "    n = np.sqrt(n)\n",
    "print(\".2f\",consum)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
